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ABSTRACT 

NVD is one of the most popular databases used by researchers 
to conduct empirical research on data sets of vulnerabilities. 
Our recent analysis on Chrome vulnerability data reported 
by NVD has revealed an abnormally phenomenon in the 
data where almost vulnerabilities were originated from the 
first versions. This inspires our experiment to validate the 
reliability of the NVD vulnerable version data. In this exper- 
iment, we verify for each version of Chrome that NVD claims 
vulnerable is actually vulnerable. The experiment revealed 
several errors in the vulnerability data of Chrome. Further- 
more, we have also analyzed how these errors might impact 
the conclusions of an empirical study on foundational vul- 
nerability. Our results show that different conclusions could 
be obtained due to the data errors. 

1. INTRODUCTION 

The last few years have seen a significant interest in em- 
pirical research on data sets of vulnerabilities. Public third- 
party vulnerability databases, e.g., such as Bugtraq [11], 
ISS/XForce [3], National Vulnerability Database (NVD) [5], 
Open Source Vulnerability Database (OSVDB) [8], are mostly 
preferred by researchers due to their diversity, availability, 
and popularity. Among these, NVD is one of the most popu- 
lar ones. The CVE-ID, i.e. the identifier of each NVD entry, 
is usually used as a common vulnerability identifier among 
other third-party data sources. In this type of research, 
the quality of data sources play a crucial role in empirical 
research on software vulnerabilities. If the data sources con- 
tain wrong data, any conclusion derived from these data 
sources may be potentially invalid. 

Our research started from an abnormality in the data 
when we analyzed the NVD. According to our analysis of 
NVD data, all of vulnerabilities in Chrome v2-vl2 were 
originated from version vl.O. To explain this, the follow- 
ing scenarios might occur: either yet more vulnerabilities in 

*This work is supported by the European Commission under 
the project EU-SEC-CP-SECONOMICS 
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newer versions have not been detected, or there is a problem 
in the vulnerability data of Chrome, or both. 

The analysis was based on an NVD data feature called 
'vulnerable software and versions ' (or vulnerable versions for 
short). This feature remarks versions of particular applica- 
tions that are vulnerable to the vulnerability described in 
the entry. For example, CVE-2008-7294 lists all Chrome 
versions before v3. 0.195.24 in its vulnerable versions: this 
means that vulnerability affects Chrome v3.0 and all ret- 
rospective versions. According to an archive document^, 
the information reported in this feature is ^'obtained from 
various public and private sources. Much of this informa- 
tion is obtained (with permission) from CERT, Security Fo- 
cus and ISSX-Force". Furthermore, our private communica- 
tions with National Institute of Standards and Technology 
(NIST), host of NVD, and software vendors, have revealed 
a "paradox": NIST claimed vulnerable versions were taken 
from software vendors; whereas, software vendors claimed 
they did not know about this information. In other words, 
the original source of this feature is unknown to the public, 
and therefore its quality is unclear. 

This raises a major threat to the validity of studies explor- 
ing this feature such as [4,7,9, 10, 13], and possibly others. 
We believe this may be a strong motivation to check for the 
reliability of NVD. 

1.1 Contribution 

The major contributions of this work are as follows: 

• We present a replicable experiment to validate the reli- 
ability of "vulnerable software and versions" feature of 
NVD for Chrome. This experiment can be applied for 
other open source applications {e.g., Firefox, Linux). 

• We show that the error rates of vulnerabilities in Chrome 
versions are significant. The errors are both erroneously 
reporting vulnerabilities in past and future versions. 

The rest of the paper is organized as follows. We present 
our research question and hypothesis (§2). After that we 
describe the validation method (§3) that we follow to con- 
duct the experiment. Next, we report our result and per- 
form analysis on collected data (§4). We also discuss the 
bias that might affect our studies and how to mitigate them 
(§5). Next we briefly review studies mostly related to our 
work (§6). Finally, we conclude our paper and discuss about 
the future work (§7). 

^This page is removed, but can be accessed by 
url http : //web . archive . org/web/20021201184650/http : 
/ / icat . nist . gov/ i c at _ document at ion . htm 



2. RUNNING EXAMPLE AND RESEARCH 
QUESTION 

We elaborate a running example on foundational vulnera- 
bilities of Chrome to study the impact of the (un) reliability 
of NVD data. A foundational vulnerability [9] is one that 
was introduced in the very first version of a software (i.e. 
vl.O), but discovered later in newer versions. In theory, foun- 
dational vulnerabilities have higher chance to be exploited 
than others because they are exposed to attack longer than 
others. By finding these vulnerabilities in vl.O, attackers 
could use them to exploit recent versions (say, v20) at the 
release date. As the result, foundational vulnerabilities are 
a source for zero-day exploits. 

By June 2012, NVD reported 539 vulnerabilities for 12 
stable versions^ of Chrome"^. Out of these, 460 (85.3%) are 
reported as foundational. Figure 1 depicts the fraction of 
foundational vulnerabilities of Chrome. Clearly, each ana- 
lyzed version of Chrome is rife with foundational vulnerabil- 
ities: 99.5% on average are foundational. We find unlikely 
that Chrome developers introduced a lot of vulnerabilities 
in the first version, but none was introduced for the subse- 
quent 11 versions. This motivate our research question as 
follows. 

RQl To what extent is the 'vulnerable versions' feature of 
the data reported by NVD truthworthy? 

To have such knowledge, for each pair of NVD entry and 
software version listed in the vulnerable versions data fea- 
ture, we verify whether the NVD entry impacts the corre- 
sponding version or not. If it is not, this pair is an error. The 
ratio of the number of error and the number of pairs is the 
error rate which we use as an indicator for the unreliability. 
In many cases, a small error rate is acceptable. Depending 
on the type of study the acceptable threshold of errors may 
vary. Here we choose the threshold of 5% which is normally 
considered a threshold for statistical significance. We con- 
sider the error rate as significant if the median of error rates 
in individual versions is significantly greater than 5%. We 
test the median of error rates, rather than the mean because 
a previous study [4] has shown that vulnerabilities do not 
follow the normal distribution. Hence, we test the following 
hypothesis: 

HI The median of error rates for vulnerabilities reported in 
Chrome versions is greater than 5%. 

We employ the non-parametric tests for the median to 
check for the significance. 

3. VALIDATION METHOD 

To verify 539 vulnerabilities for 12 versions of Chrome, we 
need to check 5, 158 pairs of vulnerability and version. Such 
huge amount of pairs is impossible for a manual verification. 
Additionally, the manual approach is not replicable. Thus, 
our proposed method is based on a repeatable and automatic 
approach [12] where security bug fixes are traced back to 
the code base to locate the vulnerable code responsible for 

^http : / / omahaproxy . appspot . com/ about , visit on July 
2012. This is a web application supported by Google team 
for tracking releases. 

^We only consider 1+ year old versions as to allow their 
vulnerability data to mature. 
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Above it is the stack bars represent the fraction of foundational and 
non-foundational vulnerabilities, below it is the release date of the 
versions. 

Figure 1: Foundational vulnerabilities in Chrome. 
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Figure 2: Validation method overview. 

vulnerabilities. Then we can determine whether a version 
claimed as vulnerable is actually vulnerable. Our method 
relies on the following assumptions. 

ASSl When developers commit a bug- fix, they denote the 
bug ID in the commit message. 

ASS 2 If the fragment of code responsible for a vulnerability 
is not there then the software is not vulnerable. 

AS S3 If a vulnerability is fixed by only adding code to a 
vulnerable file, all versions containing the non- fixed 
revision of the vulnerable file are vulnerable. 

By vulnerable files we mean the files developers changed 
to fix a vulnerability. Also, by code responsible for a vul- 
nerability we mean the code that developers changed to fix 
the vulnerability. In some cases, the changed code might 
not be the vulnerable one, but it helped to remove the vul- 
nerability, despite the original buggy code not being edited. 
For example, a vulnerability that could lead to SQL injec- 
tion attack could be fixed by inserting a sanitizer around the 
source in another module. However, missing of such sani- 
tizer does not mean the application is vulnerable to the same 
SQL injection attack. Even though the changed code is not 
buggy in some cases, we still abuse the concept and call it 
code responsible. 

Figure 2 sketches the steps of the proposed method. The 
input of the process is a list of vulnerabilities and the output 
is list of vulnerabilities annotated with vulnerable versions. 
The details are as follows: 

STEP 1 Repository mining. This step takes the list of vul- 
nerabilities and the commit log (i.e. list of commits) 
generated by the repository to produce commits of se- 
curity bug fixes. A commit of security bug- fixes is 



$ s vn ci±£f — c 9 5 7 31 url fixer upper. cc 

@ @ ^5 4 Q 3 _|_ 5 4 Q ; ^ g start line index and number of lines 

_ hoo l Is file = t r_u.e._;_ of the left, and the right revisions 
^ ~ GURL gur 1 ( trimmed.) ; added line preceded by a '+' 

+ If (gurl.l s val Id. ( ) & & . . . ) 

+ Is file = false; 

FllePath fiall path; deleted line 

If ( ! ValldPathForFlle (...) { ^y ^^ preceded by a '-' 

+ If (Is^flle && ! ValldPathForFlle (... ) { 

Figure 3: An excerpt of the dif f of two revisions. 

$ svn annotate -c 95730 ur l_f lxer_upper . cc 
^committed revision^ committer 

537: (jLsX Initial . commit J PrepareStrlngForFlle. . . 

538: 15 Initial . commit 

539: 15 Initial . commit bool ls_flle = true; 

541: 8536 estade@chromlum.org FllePath full_path; 

542: 15 Initial . commit If (! ValldPathForFlle (...) ) { 

543: 15 Initial . commit // Not a path as entered. 

Figure 4: An excerpt of the annotation. 

one that mentions a security bug ID"^ in its commit 
message and in some special patterns. These patterns 
may be vary in different software. For Chrome, they 
are BUG=n ( , n) * , or BUG=http : / / crbug . com/ n where n 
is the bug ID. 

STEP 2 Repository back-tracing This step takes commits of 
security bug- fixes, and the annotated source files from 
the repository to produce revision-annotated responsi- 
ble lines of code (LoC). For each source file / in each 
commit, let r fixed be the revision of this commit. We 
compare revision r fixed to revision r fixed — 1 of file 
/ using the diff command supported by the reposi- 
tory. The comparison output is in Unify Diff format, 
as exemplified by Figure 3, where we compare revision 
r95730 and r95731 of file url_f ixer_upper . cc. By 
definition, responsible LoC appears in r fixed — 1, but 
not in r fixed- For instance, from Figure 3, the respon- 
sible LoC is {542}. We ignore trivial responsible LoC 
such as empty lines, or lines that contain only '{'or 

'}'• 

Next, we execute annotate command for r fixed ~ 1 of 
file / to obtain the revisions of responsible LoC. Fig- 
ure 4 presents an excerpt of the annotated file url_f ixer 
_upper . cc. We see that the revision of LoC 542 is rl5. 

There is a special case where the comparison between 
r fixed and r fixed — 1 contains no line preceded with the 
minus sign. It means developers fixed the vulnerability 
by adding code only {e.g., security check). In this case 
we assume that all versions containing revision r fixed — 
1 and lower are vulnerable (see ASS3). 

STEP 3 Responsible code scanning. This step looks for each 
revision-annotated responsible LoC in the code base 
of every version. If found, we append the correspond- 
ing version and the LoC to a list of version-annotated 
responsible LoC (see ASS2). From this list, we can 
identify vulnerable versions for each vulnerability. 

Notice that there are unverifiable vulnerabilities for which 
the method can not verify the corresponding vulnerable ver- 
sions. This could be due to a couple of reasons. First, there 
is no corresponding security bug for a vulnerability. Second, 
STEP 1 may not be always able to determine the commit for 
security bug- fixes of vulnerabilities. 



Table 1 shows a few examples of Chrome vulnerabilities 
where we apply the method to verify their vulnerable ver- 
sions. The two first columns indicate the input of the method 
where we have list of NVD entries and and their correspond- 
ing bug. We additionally annotate the vulnerable versions 
reported by NVD for each entry next to the CVE-ID. The 
next columns show the outputs of the steps. The dash line 
indicates the data is not available. It means the correspond- 
ing NVD entry is not verifiable. 

For a better understanding, we describe how the NVD 
entry 2011-2822 is verified as in Table 1. This vulnerability 
is reported to affect Chrome vl up to vl3. Its corresponding 
bug is 72492. In step 1, by scanning the log, the bug fix 
is found at revision r95731 of file url_f ixer_upper . cc. In 
STEP 2, we diff revision r95730 and r95731 of this file (see 
Figure 3). The responsible LoC is determined as {542}. 
Then we annotate r95730 of the file to get the revision of 
the responsible LoC, which is {rl5} (see Figure 4). In step 

3, we scan for this line in the code base of all versions, and 
found it in vl to vl3. Finally, we identify the vulnerable 
versions for this vulnerability, which are vl-vl3. 

DISCUSS ABOUT MANUAL VERIFICATION OF THE 
RESULT 

4. RESULTS AND EXAMPLE REVISED 

We apply the proposed method to verify vulnerabilities of 
major versions of Chrome from vl to vl2. By June 2012, 
NVD reported 539 entries^ that allegedly affect these ver- 
sions of Chrome. Out of these, 503 entries have links to 552 
security bugs in Chrome Issue Tracker in their references sec- 
tion. The method took 16 hours on a 3 x quad-core 2.83GHz 
Linux machine with 4GB of RAM to complete. 

As the result, 167 NVD entries (31%) are verifiable, and 
372 (69%) are unverifiable. Among the verifiable ones, 134 
(81%) have errors, i.e. their verified vulnerable versions 
are different than reported ones. Among the unverifiable, 
36(10%) do not have corresponding bugs, and for 336(90%) 
we could not locate their commits of security bug fixes. We 
have done a qualitative analysis on these entries and found 
that they are bugs in external projects used in Chrome, e.g., 
WebKit - the HTML rendering engine, V8 - the java script 
engine, and so on. Therefore, their commits of bug fixes do 
not exists in the repository of Chrome. Later we will discuss 
how to work around this problem as a part of future work. 

In the following, we analyze the difference of vulnerabil- 
ities in individual Chrome versions. Let eve be an NVD 
entry, and v be the version in analyzed, we define: 

• V{cve): is a set of reported vulnerable versions of eve. 

• V'{cve): is a set of verified vulnerable versions of eve. 
If eve is unverifiable, V' {eve) = _L. 

• verified{v) = {cve\v G V' {eve)}: is the eve which re- 
sponsible code is detected in version v. 

• erroneous{v) = {cve\v G V{cve) Av^ V'{cve)}: is the 
set of verifiable eve which responsible code is not de- 
tected in version v. 

• unverifiable{v) = {cve\v G V{cve) A V\cve) = 1.}: is 
the set of unverifiable eve of version v. 



^Bugs appear in the input list of vulnerabilities. 



^Observation on July 2012 



Table 1: The execution of the method on few Chrome vulnerabilities. 
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Output of STEP 2 


Output of STEP 3 


CVE-ID of NVD 

entry 


Corresponding 
Bug 


Commits for Bug-fixes 


Revision-annotated 
responsible LoC 


Version-annotated responsible Verified vulnerable 
LoC versions 


2011-2822 (vl-vl3) 

2011- 4080 (vl-v8) 

2012- 1521 (vl-vl8) 


72492 
68115 
117110 


url_fixer_upper.cc^ (r95731) 
media_bench.cc^ (r70413) 


(rl5, 542) 
(r26072,352), (r53193, 353) 


{vl - vl3,54.2) vl-vl3 
(^;3 - ^;8, 352), {v5 - v8, 353) v3-v8 



: chrome/browser/net /url_f ixer_upper. cc ^: media/tools/media_bench/media_bench . cc 




(a) Error Rate 



(b) Types of Error 



In the whicker-box plot, the whickers represent the min and max 
value, the bold line in the middle is the median value, and the lower 
and upper part of the box are the quartile of the distribution. The 
blue dash line at 0.05 shows the threshold of error rate. 

Figure 5: The errors in vulnerable version data of 
NVD entries for Chrome. 

For example, according to Table 1, ^(2011-4080) = {vl - v8}, 
1/^(2011-4080) = {^;3 - ^;8}. Then, 2011-4080 is a verified 
in v3 i.e. 2011-4080 G verified{v3); whereas it is an erro- 
neous in vl i.e. 2011-4080 G erroneous{vl) . 

By ignoring the unverifiable vulnerabilities, the error rate 
of a version v of Chrome is defined as the ratio of the number 
of erroneous vulnerabilities of v by the number of verifiable 
vulnerabilities oi as shown in the following formula: 



ER{v) 



\ erroneous{v) \ 



I verified{v) \ + | erroneous(y) \ 



Being more optimistic, we assume all unverifiable vulnera- 
bilities are all correct. The error rate is rewritten as follows: 



eb!{v) 



I erroneous{v) \ 



\unverifiable{v)\ + \ verified{v) \ + \erroneous{v)\ 

(2) 

Figure 5(a) shows the box plots for the distribution of the 
error rates in Chrome versions. In a box plot, the whickers 
represent the min and max values, the bold line in the mid- 
dle is the median, and the lower and upper parts of the box 
are the quartiles of the distribution. According to the fig- 
ure, since the median of both error rates ER, ER' are much 
greater than 5%. It remarkably denotes that the number of 
erroneous vulnerabilities is not negligible. This is confirmed 
in the one-sided Wilcoxon rank-sign test where the null hy- 
pothesis is "the median of error rates is 5%", and the alter- 
native hypothesis is HI. The returned p-value for ER and 
ER are almost zero (2.44-10"^, and 1.22- 10"^respectively). 
It means the error rates (both ER and ER') did not ran- 
domly happen and therefore are significantly greater than 
5%. 

We break down erroneous vulnerabilities into following 
categories: 

• stretched-past error (P-error): is the set of erroneous 
vulnerabilities whose version v is older than all versions 
that the NVD entries are verified to impact to. 

P-error{v) — {eve G erroneously) \v < mm(V' (eve))} 
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Figure 6: Verifiable (left) vs. Verified (right) vul- 
nerabilities. 

• future- version error (F-error): is the set of erroneous 
vulnerabilities whose version v is newer than all ver- 
sions that the NVD entries are verified to impact to. 

F-error{v) = {eve G erroneous{v)\v > max('\/^(c'ue))} 

• beta error (B-error): is the set of erroneous vulnera- 
bilities whose corresponding NVD entries only impact 
non-official versions, i.e. V'{cve) — 0. 

B-error{v) = {eve G erroneous{v)\V' (eve) — 9} 

Similarly to (1), we calculate the stretched-past error rate, 
future- error rate, and beta error rate. Figure 5(b) reports 
the distributions of these rates. The P-error is slightly 
greater than F- error, and both of them are much greater 
than B-error. B-error seems to be negligible. We employ 
Wilcoxon rank-sum test to compare each pair of error cat- 
egories. Since we compare one category to other two, the 
Bonferroni correction is applied i.e. the significance level is 
divided by 2: a /2 = 0.025. The test resuh confirms 

that both P-error and F- error are significantly greater than 
B-error since the returned p-values are less than a. The 
p-value = 0.03 > a of the comparison between P-error and 
F- error can be considered an evidence (even if not signifi- 
cant) that P-error is greater than F- error. 

Hereafter we revise the running example about founda- 
tional vulnerability in Chrome. We assume that the same 
ratio of errors would be applied in the unverifiable vulnera- 
bilities. Therefore, in following analysis we study the impact 
of error in verifiable vulnerabilities when we study the trend 
of foundational vulnerability. 

We have two data sets: Verifiable and Verified. The for- 
mer is the set of verifiable vulnerabilities that we could verify 
by the proposed method. Their vulnerable versions are re- 
ported by NVD. The latter is also the same set, but their 
vulnerable versions are verified. 

Figure 6 illustrates the fraction of foundational vulnera- 
bilities in each version. Left is based on Verifiable, and right 
is based on Verified. In this figure. The circles denote the 
percentage of foundational vulnerabilities. By looking at 
the left picture, even though the absolute number of foun- 
dational vulnerabilities decreased, their fractions are almost 
unchanged. Additionally, as aforementioned, it is strange 
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(a) Verifiable vulnerabilities 
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(b) Verified vulnerabilities 

Figure 7: Trend of foundational vulnerability dis- 
covery. 

that vulnerabilities are only introduced in vl.O, but none 
are introduced in later versions. However, by looking at the 
right side, this phenomenon disappears. Moreover, there is a 
decreasing trend in the foundational vulnerabilities fractions 
from v2.0 to vl2.0. We additionally employ the Wilcoxon 
rank-sum test on the absolute number and the fraction of 
foundational vulnerabilities in each version. The test results 
show that the difference between the left and the right is in- 
deed not random since the returned p-values are almost zero 
{i.e. 3.82 • 10~^, and 3.84 • 10"^ respectively). 

Furthermore, we replicate the analysis on the trend of 
foundational vulnerability discovery as described in [9] . Fig- 
ure 7 exhibits the analysis result on Verifiable (Figure 7(a)) 
and Verified (Figure 7(b)). In the figure, left is the discov- 
ery rate of foundational vulnerabilities discovered monthly 
since the release date, right is the Laplace test for trend in 
monthly discovered foundational vulnerabilities. Two dot- 
ted horizonal lines at value 1.96 and —1.96 indicate the range 
such that if a value of Laplace factor is out of this range, 
it is significant evidence for either an increasing (> 1.96) or 
a decreasing (< —1.96) trend in the data. Such values are 
indicated as green (gray) circles. Again we see a clearly dif- 
ference between two discovery rates between two data sets. 
The p-value of the Wilcoxon rank-sum test is almost zero 
(0.0008): this indicates that the difference is significant. We 
can also see the difference in the trend of discovery. By 
using Verifiable, we might observe several significant trends 
(both increasing and decreasing) of foundational vulnerabil- 
ity discovery. Some of these trends, however, disappear in 
Verified. 

In short, our experiment provides evidence that the error 
in the vulnerable versions feature of NVD entries for Chrome 
is not negligible. Among the errors, NVD tends to commit 
more stretch-past error than others. It is one of the reasons 
for the abnormality that 99.5% vulnerabilities of Chrome are 
foundational. This error in NVD has significantly impact to 



the analysis of foundational vulnerabilities where different 
conclusions can be drawn. 

5. THREAT TO VALIDITY 

Construct validity includes threats affecting the approach 
by means of which we collect and verify vulnerabilities. Threats 
in this category come from the assumptions as follows. 

By making the assumption ASSl, we delegate the com- 
pleteness of our method to the responsibility of developers 
and the quality control of the software vendor. According 
to [2], there are two types of mistakes: the developers do not 
mention the bug ID in a bug- fix commit; and the developers 
mention a bug ID in a non-bug- fix commit. Also in [2], the 
authors showed that the latter is negligible, while the former 
does exist. To evaluate the impact of the latter mistakes, 
we have done a qualitative analysis on bug- fix reports, and 
we found that all analyzed bug-fix commits are actually bug 
fixes. As for the former mistakes, we check the completeness 
of the bug- fix commits for vulnerabilities. As discussed, we 
found a large portion of vulnerabilities for which we could 
not locate the bug- fix commits. Our qualitative analysis on 
these vulnerabilities reveals that they originate from exter- 
nal projects used in Chrome. We discuss a potential solution 
addressing this threat in future work. 

The second assumption ASS 2 is apparently syntactical 
and might not cover all the cases of bug fixes since it is 
extremely hard to automatically understand the root of vul- 
nerabilities. The assumption also means that if a version 
contains at least one line of responsible code, this version 
is vulnerable. Together with the assumption AS S3, our 
method might overestimate the vulnerable versions by clas- 
sifying safety code as buggy (error type I, false positive). 
However, since most Chrome vulnerabilities are reported as 
foundational, if we overestimate the vulnerable version, the 
reported error rate is a lower bound of the actual one. 

Besides, a technical threat to construct validity may be 
the buggy implementation of the method. To minimize 
such problem, we employ multi-round test-and-fix approach 
where we ran the program on some vulnerabilities, then we 
manually checked the output, and fixed found bugs. We re- 
peated this procedure until no bug has been found. Finally, 
we randomly checked the output again to ensure there was 
no mistake. 

Internal validity concerns the causal relationship between 
the collected data and the conclusion withdrawn in our study. 
Our conclusions are based on statistical tests. These tests 
have their own assumptions. Choosing tests whose assump- 
tions are violated might end up with wrong conclusions. To 
reduce the risk we carefully analyzed the assumptions of the 
tests: for instance, we did not apply any tests with normal- 
ity assumption since the distribution of vulnerabilities is not 
normal. 

External validity is the extent to which our conclusion 
could be generalized to other applications. Our experiment 
is based on the vulnerability data of Chrome. So, to have 
a more generalized conclusion, a replication of this work on 
other applications should be done. 

6. RELATED WORK 

Sliwerski et al [12] proposed a technique that automati- 
cally locates fix-inducing changes. This technique first lo- 
cates changes for bug fixes in the commit log, then de- 



termines earlier changes at these locations. These earlier 
changes are considered as the cause of the later fixes, and 
are called fix-inducing. This technique has been employed in 
several studies [12, 14] to construct bug- fix data sets. How- 
ever, none of these studies mention how to address bug fixes 
which earlier changes could not be determined. These bug 
fixes were ignored and became a source of bias in their work. 

Bird et al [2] conducted a study the level bias of techniques 
to locate bug fixes in code base. The authors have gathered 
a data set linking bugs and fixes in code base for five open 
source projects, and manual checked for the biases in their 
data set. They have found strong evidence of systematic 
bias in bug- fixes in their data set. Such bias might be also 
existed in other bug- fix data set, and could be a critical 
problem to any study relied on such biased data. 

Antoniol et al [1] showed another kind of bias that the 
bug- fixes data set might suffer from. Many issues reported 
in many tracking system are not actual bug reports, but 
feature or improvement requests. Therefore, this might lead 
to inaccurate bug counts. However, such bias rarely happens 
for security bug reports. Furthermore, Nguyen et al [6], in 
an empirical study about bug- fix data sets, showed that the 
bias in linking bugs and fixes is the symptom of the software 
development process, not the issue of the used technique. 
Additionally, the linking bias has a stronger effect than the 
bug-report-is-not-a-bug bias. 

7. CONCLUSION AND FUTURE WORK 

In this paper we have conducted an experiment to verify 
the reliability of the vulnerable versions data of Chrome vul- 
nerabilities reported by NVD. The experiment has revealed 
that the error in the vulnerable versions data is notable. 
Among verifiable vulnerabilities of individual Chrome ver- 
sion, approximately 25% of them are erroneous. If we as- 
sume that all unverifiable vulnerabilities are all correct, still 
more than 7% are erroneous overall. We also demonstrated 
how these erroneous vulnerabilities could potentially impact 
the conclusion of foundational vulnerability study. Another 
study on the impact of erroneous vulnerabilities is further 
discussed in Appendix A . This experiment has shed a light 
into the (un) reliability of the NVD, and allows researchers 
to revisit the reliability of existing vulnerability databases. 

However about two-third of Chrome vulnerabilities are 
unverifiable because they are vulnerabilities of the external 
projects used in Chrome. To be able to verify them, extra 
effort is required. First, we need to link the unverifiable vul- 
nerabilities to the bug ID of the external projects. This could 
be done by parsing the Chrome bug report. Our qualita- 
tive study on several unverifiable vulnerability reports shows 
that all of them have links to bug reports of the external 
projects. Second, we apply the proposed method to identify 
vulnerable revisions of the external projects. Finally, we 
link these vulnerable revisions to the version of Chrome by 
looking at the repository of Chrome. For example. Chrome 
vl2.0 uses WebKit revision 80695, V8 revision 7138. A more 
detail discussion can found in Appendix B . 

Also as a part of future work, we plan to evaluate the 
robustness of the proposed method in identifying vulnerable 
revisions correctly. We also plan to repeat the experiment 
on other open source software like Firefox to have a better 
insight about the reliability of NVD. 
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APPENDIX 

A. VULNERABILITY DISCOVERY MODEL 
REVISIT 

This section analyzes to what extent the erroneous vulner- 
abilities can affect the conclusions in the validation experi- 
ment on vulnerability discovery models (VDM) [7] . Readers 
are referred to [7] for full detail of the experiment as well 
as the metrics. Here we only sketch the experiment and its 
conclusions that might be affected. 

A.l Validation Experiment Overview 

Nguyen and Massacci [7] conducted an experiment to study 
the performance of six VDM {i.e. AML, AT, LN, LP, RE, 
and RQ^) on six versions of Chrome (vl-v6) and other 
browsers. That experiment fitted all VDM to the monthly- 
observed vulnerabilities of each version from the sixth month 
after released to present (time of writing) . The goodness-of- 

^The full names of these models are: AML - Alhazmi- 
Malaiya Logistic; AT - Anderson Thermodynamic; LN - 
Linear; LP - Logistic Poison; RE - Rescorla Exponential 
(RE); RQ - Rescorla Quadratic. 
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fit of VDM is assessed goodness-of-fit tests. Based on the 
returned p-values of tiiese tests, tiie autiiors calculated two 
quantitative metrics, namely goodness-of-fit entropy E^(t), 
and goodness-of-fit quality Qoj(t). These metrics are used to 
analyze stability of the data sets and the overall performance 
of VDM are analyzed. Therefore, to study the impact of the 
erroneous vulnerabilities, we replicate the experiments on 
two data sets of verifiable vulnerabilities as below: 



O 



nvd. verifiable : 
nvd. verified : 



{{cve^v) \V' {eve) / ±} 
\^{cve^v) \v G V' {cve)^ 



A.2 Experiment Revisit 

We fit all VDM to the monthly-observed vulnerabilities of 
each of fourteen versions of Chrome vl-vl2 from the sixth 
month after released to present^. Then we calculate the 
entropy and quality of each VDM in each dataset. 

Figure 8(a) plots the evolution of entropy for both nvd . verifiable 
(solid line) and nvd. verified (dash line). Figure 8(b) shows 
the distribution of the entropy of these two data sets. The 
entropy of nvd. verifiable seems to be a bit greater than 
that of nvd. verified. We perform a paired Wilcoxon test 
to check the null hypothesis that "there is no different be- 
tween the medians of the entropies of the two data sets. The 
returned p-value of 0.04 shows that the difference between 
the entropies of the two data sets is statistically significant. 

Figure 9 plots the quality of the VDM in the two data 
sets nvd. verifiable (solid lines) and nvd. verified (dotted 
lines). The X-axis is the month since release (MSR), and the 
Y-axis is the quality of VDM, i.e. the ratio of the number 
of times the VDM well fits the observed data {p-value > 
0.05) by the total number of time the VDM is fitted to the 
observed data. The qualities of the five models AT, LN, LP, 
RE, and RQ between the two data sets are mostly the same 
in the some first MSR. But these qualities are much different 
in later MSR. Meanwhile, the qualities of the AML are very 
different in some first MSR, but get closer in later MSR. 

We perform the paired Wilcoxon test for the qualities 
of VDM. The null hypothesis is "there is no different be- 
tween the quality of VDM in two data sets". The returned 
p-value = 0.15 > 0.05 is a weak evidence about the differ- 
ence between qualities in the two data sets. We addition- 
ally perform the paired Wilcoxon test for the qualities of 
VDM individually in two data sets. The null hypothesis is 
also "there is no different between the quality of VDM in 
two data sets". Table 2 reports the p-values of the tests. 
The p-value < 0.05 (bold value) denotes the significance of 
the difference between the qualities of VDM in two data 
sets. According to the table, the qualities of four models 
LN, LP, RE and RQ in two data sets are significantly differ- 
ent; whereas the qualities of AML and AT in two data sets 
are not able to conclude. 

In summary, the erroneous vulnerabilities significantly change 
the entropy of individual VDM. For the quality, the im- 
pact of the erroneous vulnerabilities to the individual VDM's 
performance is different. The error does not change much 
the performance of AML, AT; whereas, it does significantly 
change the performance of LN, LP, RE and RQ model. How- 
ever, we only have a weak evidence about the difference of 
the overall quality of VDM in the two data sets, and so does 
the entropy. Therefore, erroneous vulnerabilities of Chrome 
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Figure 8: The entropy Ei{t) between nvd.verifiable 
and nvd.verified. 

partially change the performance of VDM as well as the 
conclusions in [7]. 
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DEALING WITH BUG FIXES IN EXTER- 
NAL PROJECTS 



^the time of data collection, July 2012 
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This figure plots the quahty of VDM in the two data sets nvd. verifiable (solid lines) and nvd. verified (dotted lines). The X-axis 
is the month since release, and the Y-axis is the quality of VDM, i.e. the ratio of the number of times the VDM well fits the 
observed data {p-value > 0.05) by the total number of time the VDM is fitted to the observed data. 



Figure 9: The quality of VDM in nvd. verifiable (solid lines) and nvd. verified (dotted lines). 



Table 2: The p-values of the tests of VDM quality 
in two data sets. 

The table reports the p-values of the tests of the null hypoth- 
esis that "there is no different between the quality of VDM 
in two data sets". The bold values less than 0.05 denote the 
significant of the test. 



AML AT LN LP RE RQ 



p-value 0.32 0.35 < 0.01 < 0.01 < 0.01 < 0.01 



